Automated hippocampal segmentation algorithms evaluated in stroke patients

Deep learning segmentation algorithms can produce reproducible results in a matter of seconds. However, their application to more complex datasets is uncertain and may fail in the presence of severe structural abnormalities—such as those commonly seen in stroke patients. In this investigation, six recent, deep learning-based hippocampal segmentation algorithms were tested on 641 stroke patients of a multicentric, open-source dataset ATLAS 2.0. The comparisons of the volumes showed that the methods are not interchangeable with concordance correlation coefficients from 0.266 to 0.816. While the segmentation algorithms demonstrated an overall good performance (volumetric similarity [VS] 0.816 to 0.972, DICE score 0.786 to 0.921, and Hausdorff distance [HD] 2.69 to 6.34), no single out-performing algorithm was identified: FastSurfer performed best in VS, QuickNat in DICE and average HD, and Hippodeep in HD. Segmentation performance was significantly lower for ipsilesional segmentation, with a decrease in performance as a function of lesion size due to the pathology-based domain shift. Only QuickNat showed a more robust performance in volumetric similarity. Even though there are many pre-trained segmentation methods, it is important to be aware of the possible decrease in performance for the segmentation results on the lesion side due to the pathology-based domain shift. The segmentation algorithm should be selected based on the research question and the evaluation parameter needed. More research is needed to improve current hippocampal segmentation methods.

training [17][18][19][20][21][22] . Unfortunately, these networks are mainly trained on healthy volunteers or special patient groups such as Alzheimer's patients and are not explicitly designed to account for other brain disorders. Large brain lesions common to stroke patients represent a domain shift for these pre-trained segmentation networks 23,24 , which may cause a significant drop in the segmentation performance.
Not only is the training of these networks based on pre-annotated datasets, but the evaluation of segmentation results is traditionally based on agreement measures with reference segmentation. However, most large datasets lack this reference 'ground truth' segmentation, making conventional performance measurement evaluation impossible. For this work, we generated a "virtual" ground truth segmentation based on a consensus method using simultaneous truth and performance level estimation (STAPLE) algorithm. Omitting manual pre-labeled data will lead to more objective and reproducible results.
In this study, we explored the generalizability of recent, pretrained, open-source, deep learning-based hippocampal segmentation networks. We introduced a domain shift by changing the population to a new and unseen dataset with chronic stroke lesions to test the cross-domain transferability.
The aim was to analyze the segmentation performance with common evaluation metrics based on an agreement approach: (1) by ranking the algorithms to a virtually generated ground truth segmentation using the STA-PLE algorithms and (2) to visualize (dis)similarities in a pairwise comparison using mean values with a metric multidimensional scaling approach. (3) In a subgroup analysis we further analyzed whether the presence of the stroke lesion negatively impacted the segmentation performance. The robustness of the segmentation results was evaluated by the correlation of evaluation metrics and stroke volume.

Results
Dataset. We included n = 641/655 patients (97.86%) from n = 33 institutes. The remaining n = 14 patients (2.14%) were excluded due to movement artifacts (n = 3), due to inadequate image resolution (n = 7), and due to hippocampal involvement of the stroke lesion (n = 4). An overview of the ATLAS dataset can be found in Table 1.
In total the automatic segmentation failed for n = 53 cases from n = 48 patients, n = 8 for the e2dhipseg (n = 5 on the ipsilesional side), and n = 45 for the HippMapp3r algorithms (n = 32 on the ipsilesional side). All other algorithms showed a segmentation success rate of 100%. No erroneous STAPLE masks were detected after the visual inspection. Figure 1 depicts the hippocampal volumes as well as the concordance correlation coefficients of each segmentation algorithm with the virtual generated STAPLE ground truth. FastSurfer segmentation showed an excellent agreement in volume with the STAPLE ground truth and a concordance correlation coefficient of 0.85. Three out of six comparisons revealed a good agreement (Hippodeep, QuickNat, and AssemblyNet) and the remaining two algorithms with a fair agreement.

Volumes of segmentation and STAPLE masks.
The interpretation of all pairwise comparisons allows conclusions about agreement across methods. Only the comparison between Hippodeep and the AssemblyNet segmentations revealed an excellent agreement with a CCC of 0.815. In total seven (33.3%) pairs out of all 21 comparisons had a good agreement, six (28.6%) with a poor agreement and six (28.6%) with a moderate agreement (in detail, see Fig. 1b). The best-performing algorithm in DICE score was QuickNat (mean = 0.939, sd = 0.056) with statistically significant better performance than the second-best performing algorithm FastSurfer with a mean difference of 0.011, 95% CI [0.006, 0.015], t(1281) = 4.55, p-value < 0.0001.
Instance-based classification revealed Hippodeep, FastSurfer, and QuickNat as the three most similar algorithms to the STAPLE masks in volumetric similarity, DICE score, and average HD (Fig. 2). In general, DICE scores and average HD revealed similar ranking distributions (Fig. 2b,c) showing QuickNat as the most similar segmentation to the STAPLE mask. The analysis of the HD95 showed a more heterogeneous result with the lowest similarities to the STAPLE masks for 41.5% of HippMapp3r masks, 24.9% of e2hipseg masks, 17.1% of FastSurfer masks, 15.1% of AssemblyNet masks, 6.1% of QuickNat masks, and 3.9% of Hippodeep masks, all shown as dark red bars in Fig. 2d.  www.nature.com/scientificreports/ A representative example of the segmentation results can be seen in Fig. 3. Albeit FastSurfer had the highest volumetric similarity to the STAPLE masks, the result did not reveal the highest DICE score compared, unveiling an undersegmented hippocampal head and oversegmented tail compared to the STAPLE mask (see 3D rendering, Supplementary Fig. S1). Figure 4 shows the MDS maps for volumetric similarity, DICE score, HD95, and AVGHD based on the mean values and the residual plots. The maps allow interpreting not only the mean (dis)similarity between the algorithms and the STAPLE mask but also the similarity of two segmentation algorithms by the proximity of these algorithms and the difference by distance. For example, the maps showed that the mean segmentation results of the e2dhipseg algorithm are more similar to the mean AssemblyNet segmentation masks than to the Hippodeep segmentation results (for all evaluation metrics).
Residual plots and the stress value provides the goodness-of-fit statistic of the MDS plots. The best fit was found for volumetric similarity. For the other two metrics, some short distances were underestimated in the MDS maps.
Subgroup analyses for hemispheric stroke lesions. All algorithms revealed a smaller segmentation volume for the ipsilesional side compared with the opposite side (Supplementary Table S2 online).
The agreement analysis with the STAPLE segmentation masks showed a significant decrease in similarities in all three metrics (lower volumetric similarities and DICE score and higher Hausdorff distance) for HippMapp3r, Hippodeep, FastSurfer, and AssemblyNet (Supplementary Table S3 online). For the QuickNat segmentation results, the performance decrease was observed only for DICE score and Hausdorff distance. Instance-based similarity classification for similarity ranks to STAPLE masks. Equal values were both assigned to the inferior category to avoid additional intermediary categories. Bark blue color with the highest similarity to STAPLE mask, red color with the lowest similarity. www.nature.com/scientificreports/ Figure 5 shows Spearman's rank correlation between the evaluation metrics and the stroke volume. Hippodeep and FastSurfer algorithms showed a moderate association of stroke volume and similarity to the STAPLE segmentation (absolute R values above 0.2) with a significantly increasing poorer performance with increasing lesion size. QuickNat segmentation showed this association for average HD and HD95 and to a lesser extent for the DICE score, no association was detected for volumetric similarity.
Comparing with manual tracings in a small subset. In a smaller subset of 30 patients, the similarity to manual trancing was assessed. All results for the subgroup analysis with the manual trancing and FreeSurfer segmentation can be found in the supplement (Supplementary Table S4).
In 3 of 30 patients, the FreeSurfer algorithms failed to produce the segmentation on the lesion side (5%). For all mean evaluation metrics, the STAPLE masks showed the best results with a VS of 0.979 (0.016), DICE score of 0.979 (0.017), average HD of 0.023 (0.024), and HD95 of 2.11 (3.779), standard deviations in parenthesis. Supplementary Fig. S2 with the instance-based similarity classification to the manual segmentations. The rank distribution revealed a similar pattern to those obtained in the total dataset. Hippodeep and QuickNat with the best results among the deep learning-based methods. Interestingly, the volumetric similarity showed heterogeneity among the segmentation algorithms. FreeSurfer segmentation did not show high ranks regarding DICE, HD95, and average HD.

Discussion
Reproducible and accurate image segmentation of in vivo magnetic resonance imaging is crucial for the reliable establishment of putative image biomarkers to improve the diagnostic and therapeutic decision-making processes [25][26][27] . In general, deep learning-based segmentation is not always more accurate, but more reproducible than human raters. Even though the application is simple and fast, the performance of these algorithms might be severely hindered by domain shift, which is rooted in differences between test data and training data used during   22 ) by exposing them to a dataset with structural signal alterations due to a chronic ischemic stroke lesion 29 . The applied and highly automated workflow enables an objective examination of (dis)similarity between the different segmentation results.
Even though the different segmentation algorithms were not developed for stroke patients, they showed a high success rate in segmenting the hippocampus with only a few missing segmentations. However, as expected, the volumetric results were not interchangeable ( Fig. 1) and should be interpreted with caution 30 .
Stroke lesions rarely affect the hippocampus directly 5 , but all methods revealed a smaller volume on the side of the stroke lesion compared to the opposite side, a fact already known from the literature due to secondary neurodegeneration after the initial event [31][32][33][34] .
For this work, we used a statistical consensus method to generate a case-based virtual ground truth segmentation using the simultaneous truth and performance level estimation (STAPLE) algorithm. The usage of consensus methods-as a combination of several segmentation methods-improves the segmentation performance compared to a single method by compensating for the weaknesses of individual methods [35][36][37][38] .
In contrast to traditional methods (creating ground truth using manual segmentation), the methodological approach used in this work (creating a virtual GT mask by the STAPLE algorithm using only deep learningbased input segmentation) provides an increased reproducibility by avoiding subjective delineation or manual changes. Although the STAPLE algorithm is not entirely independent of the individual input segmentation masks, it allows for the assessment of the independent and individual contributions of each segmentation mask and method on an instance basis [38][39][40] . For a small subset, we could show that the STAPLE masks showed good results compared to manual tracing.
Segmentation methods achieved good performance compared to the STAPLE segmentation. Comparisons between the evaluation metrics did not identify a single algorithm that performed best. Instead, the algorithms showed inconsistency across different metrics, with three methods performing particularly well: FastSurfer 21 , QuickNat 18 , and Hippodeep 17 .
FastSurfer 21 has demonstrated the best mean volumetric similarity and performs well in terms of CCC (concordance correlation coefficient). This algorithm is recommended for clinical applications where volumetric analysis alone is of primary interest. It could be employed in large-scale studies or clinical routine settings to reliably quantify hippocampal volume changes in stroke patients, which makes it a valuable tool for monitoring www.nature.com/scientificreports/ post-stroke dementia and disease progression. Interestingly, higher volumetric similarity does not necessarily indicate better overlap (DICE score), which suggests a misalignment of the segmentation results due to local over-and undersegmentation (e.g., the FastSurfer segmentation result in Fig. 3 and Supplementary Fig. 1). QuickNat 18 has demonstrated excellent performance in terms of DICE score, making it particularly suitable for applications that require precise anatomical delineation. Clinically, this algorithm could be utilized in studies focusing on the analysis of functional hippocampal changes (e.g., extraction perfusion values) and their correlation with cognitive decline in post-stroke patients.
Hippodeep 17 has shown favorable performance in segmenting the hippocampus in stroke patients. While its volumetric similarity may not be as high as other algorithms, it excels in terms of mean Hausdorff distance ( Table 2). This algorithm could be particularly valuable in clinical applications that require precise localization and boundary delineation of the hippocampus in stroke patients. For example, it could be employed in studies investigating the impact of stroke lesions on hippocampal shape or asymmetry, which could further increase our understanding of neurodegenerative processes in post-stroke patients.
Therefore, we advise choosing the most suitable segmentation method depending on the specific research question. Further post-processing steps, similar to those included in the e2dhipseg algorithm 19 , could enhance the distance measure by automatically eliminating small, distant voxels not connected to the two main volumes. www.nature.com/scientificreports/ Acceptability from a clinical and scientific point of view should always be considered in the context of the research question. For example, oversegmentation can lead to a significant bias of extracted perfusion values, because neighboring anatomical structures, such as the choroid plexus have significantly higher perfusion values. Additionally, systematic bias, such as the oversegmentation of the smaller hippocampus on the lesion side, could pose a problem. However, analyzing this aspect is beyond the scope of the current project and warrants further investigation.
While deep learning-based segmentation methods have produced satisfactory results overall, the high-performing methods have demonstrated better performance and less variation when it comes to segmenting the contralateral side in comparison to the lesion side. This suggests that the algorithms may have limited generalization capability to the lesion side, likely due to their training. Additional evidence was provided by the result of the correlation analysis which showed a decreased segmentation performance with increasing lesion size only on the side of the lesion whereas these correlations were not evident for the segmentation masks on the opposite side (Fig. 5). The lack of robustness caused by the stroke lesion can be caused by the domain shift, but also by unadjusted or imprecise preprocessing steps, for example, due to internal registration processes. Further research is needed to optimize the segmentation results for the lesion side. Among the high-performing methods, the effect was the lowest for QuickNat segmentations with comparable performance in volumetric similarity, but differences in DICE score, suggesting a pronounced misalignment to the STAPLE mask on the lesion side.
To our knowledge, only one study 41 examined automatic deep learning-based hippocampal segmentation in stroke patients showing better performance for the Hippodeep segmentation algorithm compared to the Free-Surfer segmentation, a traditional atlas-based method. The authors used a quality rating of volume by calculating over-and undersegmentation, but to correctly describe the performance of a segmentation method, several evaluation metrics are mandatory 42,43 . The use of various evaluation metrics is the prerequisite for morphological and functional analysis of anatomical structures, e.g., for the extraction of radiomic features 44 , where the exact delineation of a structure, considering the shape of the structure and its alignment, is essential.
Recently an updated version of FastSurfer was proposed, FastSurferVINN 45 , which might further improve the segmentation performance, but at the time of this publication, no code was available. Due to the public availability of the dataset, the analysis will be expendable for performance analysis of upcoming, novel segmentation algorithms.
Our work had some limitations. (1) Using a common agreement approach on data with an unknown ground truth segmentation assumes that the segmentation errors of the algorithms are uncorrelated. This assumption seems to be credible because all algorithms are based on different networks and trained on different datasets, which underlines the independence of these networks and minimizes systematic errors. (2) Common agreement methods can only approximate the "true" ground truth segmentation, its exact anatomical delineation is beyond the scope of this study and reserved for pathological or high-resolution imaging studies. Therefore, the use of the STAPLE mask can only determine the precision of a segmentation method, the true segmentation accuracy remains hidden. Given the higher variability of the segmentation results on the lesion side, the STAPLE maps could present reduced accuracy for the lesion side. However, all final STAPLE segmentation masks showed volumes in the expected physiological range without pronounced error detected by visual inspection and the expected reduced volume for the lesion side. (3) Due to the two-dimensionality of the MDS maps, visualizations showed a high-stress value. The use of additional dimensions would reduce the stress and thus increase the goodness-of-fit but hamper their interpretation. (4) Even if the general performance of the segmentation algorithm is good, no conclusions can be drawn about the case-specific performance, and individual segmentation masks may be erroneous. The prediction of case-specific confident values for the segmentation quality would help to determine the individual segmentation performance 46 . (5) For this work we used the training set of the ATLAS dataset, the sample size was not determined by a power analysis. Further, the use of only stroke patients may limit the generalizability of the finding to other patient populations. (6) We could not give detailed information on the processing time, but the corresponding requirements of the different segmentation algorithms are provided in the original publication.
The cross-domain transferability of six pretrained hippocampal segmentation networks was tested using a common agreement method. The analysis could not reveal one outperforming segmentation method, but rather various high-performing methods depending on the used evaluation metric. However, the overall performance of these methods on the lesion side showed higher variability and lack of robustness depending on the lesion size. Therefore, the best segmentation method should be chosen depending on the corresponding evaluation metric and the research question. For the application of ipsilesional hippocampal segmentation, additional training of new or existing segmentation networks with stroke datasets will further improve the cross-domain generalization of segmentation results. Currently, consensus methods can help optimize segmentation results on the lesion side.
In sum, accurate hippocampal segmentation will help reliably process large imaging datasets, facilitating large-scale stroke rehabilitation research with an appropriate sample size 47 . It will be ideal for automated integration in clinical routine to reveal subtle changes in the hippocampus and provide a basis for further research on post-stroke dementia 48 .
Further research is needed to optimize the quality of hippocampal segmentation in stroke patients.

Methods
Dataset. We used the Anatomical Tracing of Lesions after stroke (ATLAS) dataset release R2.0 29 . As described previously by Liew 29 , the dataset is derived from 33 cohorts worldwide. Only the training data was included, where manually annotated T1-weighted MRI sequences of chronic lesions after ischemic stroke were available, to investigate the dependence of hippocampal segmentation on the stroke lesion side, in total n = 655 patients. Each scan contained at least one lesion. www.nature.com/scientificreports/ The sample size for the present analysis was determined by the availability of MRI data and was not derived from a power calculation.
An experienced observer (M.S. a board-certified radiologist) visually inspected all images for artifacts and hippocampal involvement. Further images with an insufficient resolution were removed, and a threshold was set to a voxel dimension of at least 1.5 mm. All T1-weighted images were reoriented to the standard orientation (fslreorient2std, FMRIB software library, http:// fsl. fmrib. ox. ac. uk/ fsl/ fslwi ki/ FSL), followed by the segmentation process.
Segmentation methods. Six, recent, open-source hippocampal segmentation algorithms were used: • Three algorithms with hippocampal-only segmentation: • e2dhipseg including the recommended automatic orientation correction by rigid registration 19  • Three algorithms for whole-brain anatomical segmentation (only the hippocampal masks were used for further processing): • All segmentation algorithms were used with the default or recommended parameter settings. Supplementary Table S1 contains general information regarding the computational requirements and processing times. We highly recommend consulting the related publications for a more comprehensive understanding of these details.
Each segmentation mask of the e2dhipseg algorithm contained only a common mask for both hippocampi. Additional postprocessing steps were added to divide the mask into the left and right hippocampus by splitting the MRI image in the mid-sagittal slice, followed by visual inspection and manual correction for cases with decentered or rotated images.
Hippodeep outputs a probability mask and was further thresholded at 0.5 as recommended 17 . For deploying the FastSurfer and QuickNat algorithms an additional preprocessing step was needed, all T1w-images were standardized using the following command from FreeSurfer (mri_convert -conform), this resamples the image to isotropic resolution (256 × 256 × 256) with some contrast enhancement. The resulting masks were back transferred to the original image space by rigid registration with 6 degrees of freedom using nearest neighbor interpolation to make the segmentation masks comparable to the other segmentation algorithms.
Further, a subgroup of n = 30 patients was selected using the FairSubset library 49 . Due to the possible bias due to the stroke volume, the "best" subset of patients with hemispheric stroke lesions was selected, controlled for the distribution of stroke volume. Manually segmented ground truth images were generated by M.F. a medical resident with 5 years' experience. Additionally, a traditional segmentation approach was performed using FreeSurfer segmentation, version 7.1.1 14,15 . The resulting masks were back-transformed in the original patient space and compared to the manual trancing masks.
Generation of ground truth. For each hippocampus a virtual ground truth image was generated using an expectation-maximization algorithm for simultaneous truth and performance level estimation (STAPLE) 39 , implemented in SimpleITK Release 2.0 50 with a wrapper for Python (STAPLEImageFilter). The algorithm uses an iterative voting process to assign individual weights to each segmentation mask to compute probabilistic estimates of the "true" underlying segmentation. For each instance, the weights of the input segmentation masks are different. The binary STAPLE segmentation masks were generated using a probability threshold of 0.999. All STAPLE masks were visually inspected to detect erroneous estimations. No manual editing was applied at any stage of the process to ensure reproducibility. Performance evaluation. The following quantitative parameters were extracted to convert the segmentations into mineable data: • Volume of the segmented hippocampus was extracted, • Segmentation success rate (defined as completed segmentation and a hippocampal volume above zero) was estimated and • Common evaluation metrics 51 obtained using the EvaluateSegmentation tool 42 available at http:// github. com/ codal ab/ Evalu ateSe gment ation for assessing inter-(dis)similarities between the six different segmentation algorithms and to the STAPLE ground-truth mask. The following four metrics were used: • Volumetric similarity (VS) to detect volume change, • DICE score as a spatial overlap-based metric to detect alignment errors, First, the resulting masks were compared with the STAPLE ground truth, segmentation volumes and performances were visualized. The similarity in the extracted volumes was analyzed using the concordance correlation coefficient (CCC). Concordance was classified as poor (0.00-0.20), fair (0.21-0.40), moderate (0.41-0.60), good (0.61-0.80), or excellent (0.81-1.00) 52 .
For each individual case and evaluation metric, segmentation results were classified into six categories for their similarity to the STAPLE mask. This analysis results in a case-by-case similarity ranking of the algorithms with respect to the STAPLE segmentation.
Further subgroup analyses were performed including only cases with hemispheric stroke lesions to compare the ipsi-and contralesional hippocampal segmentation result: (1) groups are compared with paired t-tests and (2) Spearman correlation to determine the relationship of evaluation metrics and stroke volume for both groups.
Finally, the mean (dis)similarities between the algorithms were visualized using metric multidimensional scaling (MDS) as a dimension reduction technique. For this purpose, the corresponding pairwise Euclidean distances were calculated for all segmentation pairs. The evaluation metrics for each pair were first averaged within a patient (across hemispheres) and then across all patients to generate mean evaluation metrics. For volumetric similarity and DICE score, all values were previously subtracted from 1 to obtain dissimilarity measures. The final distance matrices were determined using the cmdscale function 53 to find the best-fitting two-dimensional representation of all mean segmentation algorithm results.
The resulting MDS maps visualized (dis)similarity between the mean segmentation results so that the distances among each pair of points correlate as best as possible to the dissimilarity between those two algorithms. Please note, the orientation of the MDS maps is entirely arbitrary and does not contain any information. For better comparison, the maps were centered on the STAPLE algorithm and rotated. Residual plots and stress values were depicted to display the goodness of fit to conform the configuration of the MDS map to the mean distance matrices.
To improve visualization, Delaunay triangulation was generated for the MDS maps to connect the most similar algorithms by an edge, using the tri.mesh function of the interp package 1.1-3 54 implemented in R.

Data availability
Raw data in native space are available on the Archive of Data on Disability to Enable Policy and Research (ADDEP, https:// doi. org/ 10. 3886/ ICPSR 36684. v4). Requests to access the processed masks should be directed to M.S., marianne.schell@med.uni-heidelberg.de. Code and extracted values will be available at http:// www. neuro AI-HD. org upon acceptance.